fix(robot-server): fail proto load gracefully by sfoster1 · Pull Request #18259 · Opentrons/opentrons

sfoster1 · 2025-05-05T18:49:57Z

"Loading" a protocol, which is what we generically call the process to do everything required to get it to run, can now fail. We didn't think about this failure mode, and if it happened then we would have created some but not all of our run tracking resources, and our API would return inconsistent results (which is the bug). The orchestrator store would save the orchestrator object in a member, which then means that it would have a "current run", but since we didn't link resource allocation (or at least binding) and initialization, the exception raised after that happens would mean it never went into the database.

The immediate fix is to not bind the orchestrator object to the _orchestrator data member until right before we return, after we do all the possibly-failing orchestration.

The thing that would make this fail is to have a python protocol that was syntactically correct (passes ast.parse()) but did something at the file scope that would cause an error (like an import failure - a division by 0 would work too). This will fail when we exec the protocol to check for the presence of runtime parameters, which happens during the execution of this function, triggering the bug.

Luckily this is one of those "the error is presented badly" type of problems so it doesn't have to be prioritized.

Closes EXEC-1450

SyntaxColoring

Affirming my former (accidental) approval with this (intentional) approval. Thank you!

"Loading" a protocol, which is what we generically call the process to do everything required to get it to run, can now fail. We didn't think about this failure mode, and if it happened then we would have created some but not all of our run tracking resources, and our API would return inconsistent results (which is the bug). The orchestrator store would save the orchestrator object in a member, which then means that it would have a "current run", but since we didn't link resource allocation (or at least binding) and initialization, the exception raised after that happens would mean it never went into the database. The immediate fix is to not bind the orchestrator object to the _orchestrator data member until right before we return, after we do all the possibly-failing orchestration. The thing that would make this fail is to have a python protocol that was syntactically correct (passes ast.parse()) but did something at the file scope that would cause an error (like an import failure - a division by 0 would work too). This will fail when we exec the protocol to check for the presence of runtime parameters, which happens during the execution of this function, triggering the bug. Closes EXEC-1450

codecov · 2025-05-06T14:27:39Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 23.62%. Comparing base (bfefd85) to head (1bd8021).
Report is 5 commits behind head on edge.

Additional details and impacted files

@@           Coverage Diff           @@
##             edge   #18259   +/-   ##
=======================================
  Coverage   23.62%   23.62%           
=======================================
  Files        3055     3055           
  Lines      255734   255734           
  Branches    30114    30114           
=======================================
  Hits        60419    60419           
  Misses     195301   195301           
  Partials       14       14

Flag	Coverage Δ
protocol-designer	`19.06% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

"Loading" a protocol, which is what we generically call the process to do everything required to get it to run, can now fail. We didn't think about this failure mode, and if it happened then we would have created some but not all of our run tracking resources, and our API would return inconsistent results (which is the bug). The orchestrator store would save the orchestrator object in a member, which then means that it would have a "current run", but since we didn't link resource allocation (or at least binding) and initialization, the exception raised after that happens would mean it never went into the database. The immediate fix is to not bind the orchestrator object to the _orchestrator data member until right before we return, after we do all the possibly-failing orchestration. The thing that would make this fail is to have a python protocol that was syntactically correct (passes ast.parse()) but did something at the file scope that would cause an error (like an import failure - a division by 0 would work too). This will fail when we exec the protocol to check for the presence of runtime parameters, which happens during the execution of this function, triggering the bug. Luckily this is one of those "the error is presented badly" type of problems so it doesn't have to be prioritized. Closes EXEC-1450

"Loading" a protocol, which is what we generically call the process to do everything required to get it to run, can now fail. We didn't think about this failure mode, and if it happened then we would have created some but not all of our run tracking resources, and our API would return inconsistent results (which is the bug). The orchestrator store would save the orchestrator object in a member, which then means that it would have a "current run", but since we didn't link resource allocation (or at least binding) and initialization, the exception raised after that happens would mean it never went into the database. The immediate fix is to not bind the orchestrator object to the _orchestrator data member until right before we return, after we do all the possibly-failing orchestration. The thing that would make this fail is to have a python protocol that was syntactically correct (passes ast.parse()) but did something at the file scope that would cause an error (like an import failure - a division by 0 would work too). This will fail when we exec the protocol to check for the presence of runtime parameters, which happens during the execution of this function, triggering the bug. Luckily this is one of those "the error is presented badly" type of problems so it doesn't have to be prioritized. Closes EXEC-1450 (cherry picked from commit fe8d402)

sfoster1 requested a review from a team as a code owner May 5, 2025 18:49

sfoster1 requested a review from SyntaxColoring May 5, 2025 18:50

SyntaxColoring approved these changes May 5, 2025

View reviewed changes

sfoster1 force-pushed the exec-1450-fix-zombie-run branch from 7be77cb to 1bd8021 Compare May 6, 2025 14:19

sfoster1 merged commit fe8d402 into edge May 6, 2025
37 checks passed

sfoster1 deleted the exec-1450-fix-zombie-run branch May 6, 2025 15:11

ddcc4 mentioned this pull request May 17, 2025

Candidate chore_release-8.5.0 cherry-pick branch #18379

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(robot-server): fail proto load gracefully#18259

fix(robot-server): fail proto load gracefully#18259
sfoster1 merged 1 commit into
edgefrom
exec-1450-fix-zombie-run

sfoster1 commented May 5, 2025 •

edited by mjhuff

Loading

Uh oh!

SyntaxColoring left a comment

Uh oh!

codecov Bot commented May 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sfoster1 commented May 5, 2025 • edited by mjhuff Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SyntaxColoring left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sfoster1 commented May 5, 2025 •

edited by mjhuff

Loading

codecov Bot commented May 6, 2025 •

edited

Loading